R3: Population Stratification - Continuous Ancestry Analysis¶

Reviewer Question¶

Referee #3: "How do you address population stratification? Are signature effects binary (ancestry group) or continuous?"

Why This Matters¶

Understanding whether ancestry effects are binary (discrete groups) or continuous (graded by ancestry probability) is critical for:

  • Model interpretation: Do signatures capture continuous genetic variation?
  • Clinical translation: Can we use ancestry probability to refine predictions?
  • Biological plausibility: Continuous effects suggest polygenic architecture

Our Approach¶

We analyze signature loadings as a continuous function of ancestry probability (pred), avoiding harsh thresholding:

  1. Ancestry Probability: Use Random Forest prediction probability (pred, 0-1) instead of binary ancestry calls
  2. Reference Baseline: Use population reference theta (same as PC analysis) to calculate deviations
  3. Temporal Preservation: Calculate deviations at each time point, then average (preserves temporal relationships)
  4. Focus Signatures: Analyze signatures of interest (SAS: sig 5, EAS: sig 15) plus most variable signatures

Key Findings¶

✅ Ancestry effects are continuous, not binary ✅ Signature 5 increases continuously with SAS ancestry probability (mean deviation: 0.015 → 0.036) ✅ Signature 15 increases continuously with EAS ancestry probability ✅ Variance patterns explained by sample size (larger samples at high pred show full distribution)

1. Setup and Load Data¶

================================================================================
SIGNATURE LOADINGS BY ANCESTRY - CONTINUOUS ANALYSIS
================================================================================

2. Load Ancestry Data¶

1. Loading ancestry data...
   ✓ Loaded 498,395 rows
   Columns: ['eid', 'PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9', 'PC10', 'knn', 'pop', 'rf', 'pred', 'rf99', 'rf90', 'rf80']
   ✓ 498,395 patients with ancestry predictions
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_67394/2145359146.py:3: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  ancestry_df = pd.read_csv(ANCESTRY_PATH, sep='\t')
eid PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 knn pop rf pred rf99 rf90 rf80
0 -1 -0.164125 -0.016342 -0.005988 -0.002362 -0.001290 -0.000793 -0.000692 -0.002821 -0.000928 -0.001011 AFR UKB AFR 1.000 AFR AFR AFR
1 -10 0.083763 -0.125806 -0.026856 -0.032077 -0.008137 -0.001927 -0.000891 -0.019059 -0.003229 -0.003247 EUR UKB EUR 0.999 EUR EUR EUR
2 -100 0.082869 -0.128098 -0.026586 -0.032438 -0.003435 -0.001640 -0.001252 -0.013836 -0.003301 -0.001948 EUR UKB EUR 0.986 NaN EUR EUR
3 -101 0.085578 -0.127146 -0.025148 -0.028368 -0.008648 -0.003687 0.000004 -0.021754 -0.000768 -0.006615 EUR UKB EUR 0.998 EUR EUR EUR
4 -102 0.083711 -0.126064 -0.024218 -0.029771 -0.009118 -0.000895 0.001435 -0.015870 -0.003183 -0.007149 EUR UKB EUR 0.995 EUR EUR EUR

3. Load Signature Loadings (Theta)¶

In [7]:
# Load signature loadings (theta) - using thetas with PCs
print("\n2. Loading signature loadings (with PCs)...")
thetas = torch.load(THETA_PATH, map_location='cpu')
if hasattr(thetas, 'numpy'):
    thetas = thetas.numpy()
elif isinstance(thetas, torch.Tensor):
    thetas = thetas.numpy()
print(f"   ✓ Loaded theta shape: {thetas.shape} (patients × signatures × time)")

# Calculate average signature loadings per patient (across time)
print("\n3. Calculating average signature loadings per patient...")
avg_signature_loadings = thetas.mean(axis=2)  # Average across time dimension
print(f"   ✓ Average signature loadings shape: {avg_signature_loadings.shape} (patients × signatures)")
2. Loading signature loadings (with PCs)...
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_67394/1498240516.py:3: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  thetas = torch.load(THETA_PATH, map_location='cpu')
   ✓ Loaded theta shape: (400000, 21, 52) (patients × signatures × time)

3. Calculating average signature loadings per patient...
   ✓ Average signature loadings shape: (400000, 21) (patients × signatures)

4. Match Ancestry Data to Signature Loadings¶

4. Matching ancestry data to signature loadings...
   ✓ Loaded processed_ids: 400,000
   ✓ Matched 400,000 patients (100.0%)

Ancestry distribution:
EUR    378921
SAS      8295
AFR      7616
AMR      3367
EAS      1801
Name: count, dtype: int64

5. Load Reference Theta and Calculate Deviations¶

Important: We use the same reference theta as the PC analysis (from reference_thetas.csv). This is the population reference trajectory, not ancestry-specific.

Calculation order:

  1. Calculate deviation at each time point: deviation_t = patient_signature_t - reference_signature_t
  2. Then average deviations across time: mean_deviation = mean(deviation_t)

This preserves the temporal relationship between patient and reference trajectories, rather than averaging first and then subtracting.

This allows us to see if ancestry-specific effects (e.g., SAS high sig 5, EAS high sig 15) increase continuously with ancestry probability.

In [14]:
# Load reference theta (same as PC analysis)
print("\n5. Loading reference theta...")
try:
    ref_theta = pd.read_csv(REFERENCE_THETA_PATH, header=0).values  # Shape: (K, T_ref)
    T = thetas.shape[2]  # Number of timepoints
    if ref_theta.shape[1] >= T:
        ref_slice = ref_theta[:, -T:]  # Use last T columns
    else:
        pad = np.zeros((ref_theta.shape[0], T - ref_theta.shape[1]))
        ref_slice = np.concatenate([ref_theta, pad], axis=1)
    print(f"   ✓ Loaded reference theta shape: {ref_slice.shape} (signatures × time)")
except Exception as e:
    print(f"   ⚠️  Could not load reference theta: {e}")
    print("   Using zeros as reference (will show raw signature loadings)")
    ref_slice = np.zeros((thetas.shape[1], thetas.shape[2]))

# Calculate deviations at each time point, then average
print("\n6. Calculating deviations from reference theta...")
print("   Step 1: Calculate deviation at each time point...")
# Deviations at each time: (matched_patients, signatures, time)
deviations_at_t = thetas[matched_indices] - ref_slice[np.newaxis, :, :]  # Broadcasting: (N, K, T) - (1, K, T)
print(f"   ✓ Deviations at each time point shape: {deviations_at_t.shape} (patients × signatures × time)")

print("   Step 2: Average deviations across time...")
# Average deviations across time
deviations = deviations_at_t.mean(axis=2)  # Shape: (matched_patients, signatures)
print(f"   ✓ Average deviations shape: {deviations.shape} (patients × signatures)")

print("\n   Note: Deviations = mean(patient_signature_t - reference_signature_t) across time")
print("   This preserves temporal relationships. Positive = higher than reference, Negative = lower than reference.")
5. Loading reference theta...
   ✓ Loaded reference theta shape: (21, 52) (signatures × time)

6. Calculating deviations from reference theta...
   Step 1: Calculate deviation at each time point...
   ✓ Deviations at each time point shape: (400000, 21, 52) (patients × signatures × time)
   Step 2: Average deviations across time...
   ✓ Average deviations shape: (400000, 21) (patients × signatures)

   Note: Deviations = mean(patient_signature_t - reference_signature_t) across time
   This preserves temporal relationships. Positive = higher than reference, Negative = lower than reference.

6. Focus on Signatures of Interest¶

Based on PC analysis findings:

  • SAS: High sig 5
  • EAS: High sig 15

We'll plot these specific signatures, plus identify the most variable signatures overall.

7. Identifying signatures of interest...
   ✓ Signatures of interest: {'SAS': [5], 'EAS': [15]}
   ✓ Top 5 most variable signatures: [15  5 19 18  6]
   ✓ All signatures to plot: [5, 6, 15, 18, 19]

Top 5 most variable signatures:
signature f_statistic p_value variance
15 15 44249.256914 0.0 1.644637e-05
5 5 5818.621923 0.0 2.845744e-04
19 19 2789.327036 0.0 6.264129e-06
18 18 2465.571266 0.0 8.804495e-07
6 6 2217.872324 0.0 6.325458e-07

7. PCA on Signature Loadings Colored by Ancestry¶

Purpose: Perform PCA decomposition on average signature loadings over time, then plot PC1 vs PC2 colored by ancestry. This shows whether ancestries are separated in the signature loading space.

Key Point: This demonstrates that PCs don't change findings much except for SAS where it's a boost.

In [16]:
# ============================================================================
# PCA on Signature Loadings: PC1 vs PC2 Colored by Ancestry
# ============================================================================

print("="*80)
print("PCA ON SIGNATURE LOADINGS: PC1 vs PC2 Colored by Ancestry")
print("="*80)

from sklearn.decomposition import PCA

# Use average signature loadings over time (not deviations)
# Calculate from thetas: average across time dimension
print("\n1. Calculating average signature loadings over time...")
avg_signature_loadings = thetas[matched_indices].mean(axis=2)  # (matched_patients, signatures)
print(f"   ✓ Average signature loadings shape: {avg_signature_loadings.shape}")

# Get ancestry labels for matched patients
print("\n2. Getting ancestry labels...")
matched_ancestry_labels = ancestry_df.iloc[matched_indices]['rf80'].values

# Handle NaN values
population_labels = []
for p in matched_ancestry_labels:
    if pd.isna(p) or (isinstance(p, float) and np.isnan(p)):
        population_labels.append('Unknown')
    else:
        population_labels.append(str(p))
population_labels = np.array(population_labels)

# Filter to valid populations only
valid_pop_mask = np.isin(population_labels, ['AFR', 'EAS', 'SAS', 'EUR'])
if valid_pop_mask.sum() < len(population_labels):
    print(f"   Filtering out non-population labels...")
    print(f"   Before: {len(population_labels):,} patients")
    avg_signature_loadings = avg_signature_loadings[valid_pop_mask]
    population_labels = population_labels[valid_pop_mask]
    print(f"   After: {len(population_labels):,} patients")

# Check population counts
unique_pops = np.unique(population_labels)
print(f"\n   Population counts:")
for pop in unique_pops:
    count = (population_labels == pop).sum()
    print(f"     {pop}: {count:,}")

# Perform PCA on signature loadings
print("\n3. Performing PCA on average signature loadings...")
pca = PCA(n_components=2)
pca_result = pca.fit_transform(avg_signature_loadings)
print(f"   ✓ PCA result shape: {pca_result.shape}")
print(f"   ✓ Explained variance ratio:")
print(f"     PC1: {pca.explained_variance_ratio_[0]:.4f} ({pca.explained_variance_ratio_[0]*100:.2f}%)")
print(f"     PC2: {pca.explained_variance_ratio_[1]:.4f} ({pca.explained_variance_ratio_[1]*100:.2f}%)")
print(f"     Total: {pca.explained_variance_ratio_.sum():.4f} ({pca.explained_variance_ratio_.sum()*100:.2f}%)")

# Population colors
pop_colors = {
    'AFR': '#1f77b4',  # Blue
    'EAS': '#ff7f0e',  # Orange
    'SAS': '#2ca02c',  # Green
    'EUR': '#d62728',  # Red
}

# Create plot
print("\n4. Creating PC1 vs PC2 plot colored by ancestry...")
fig, ax = plt.subplots(figsize=(12, 10))

# Plot each population separately
for pop in unique_pops:
    pop_mask = population_labels == pop
    if pop_mask.sum() > 0:
        color = pop_colors.get(pop, '#808080')
        ax.scatter(pca_result[pop_mask, 0], 
                  pca_result[pop_mask, 1],
                  c=color,
                  label=f'{pop} (n={pop_mask.sum():,})',
                  alpha=0.6,
                  s=20,
                  edgecolors='none')

# Add reference lines
ax.axhline(y=0, color='black', linestyle='--', linewidth=1, alpha=0.3)
ax.axvline(x=0, color='black', linestyle='--', linewidth=1, alpha=0.3)

# Labels and title
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)', 
              fontsize=13, fontweight='bold')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)', 
              fontsize=13, fontweight='bold')
ax.set_title('PCA on Signature Loadings: PC1 vs PC2\nColored by Ancestry', 
             fontsize=14, fontweight='bold', pad=20)
ax.legend(loc='best', fontsize=11, framealpha=0.9)
ax.grid(True, alpha=0.3)

plt.tight_layout()

# Save figure
fig_path = OUTPUT_DIR / 'pca_signature_loadings_by_ancestry.png'
plt.savefig(fig_path, dpi=300, bbox_inches='tight')
print(f"\n✓ Saved PCA plot to: {fig_path}")

plt.show()

# Print summary statistics by population
print("\n" + "="*80)
print("SUMMARY: Mean PC1 and PC2 by Population")
print("="*80)
for pop in ['AFR', 'EAS', 'SAS', 'EUR']:
    pop_mask = population_labels == pop
    if pop_mask.sum() > 0:
        mean_pc1 = pca_result[pop_mask, 0].mean()
        std_pc1 = pca_result[pop_mask, 0].std()
        mean_pc2 = pca_result[pop_mask, 1].mean()
        std_pc2 = pca_result[pop_mask, 1].std()
        print(f"\n{pop} (n={pop_mask.sum():,}):")
        print(f"  PC1: {mean_pc1:+.4f} ± {std_pc1:.4f}")
        print(f"  PC2: {mean_pc2:+.4f} ± {std_pc2:.4f}")

print("\n" + "="*80)
print("✓ PCA on signature loadings complete!")
print("="*80)
================================================================================
PCA ON SIGNATURE LOADINGS: PC1 vs PC2 Colored by Ancestry
================================================================================

1. Calculating average signature loadings over time...
   ✓ Average signature loadings shape: (400000, 21)

2. Getting ancestry labels...
   Filtering out non-population labels...
   Before: 400,000 patients
   After: 388,979 patients

   Population counts:
     AFR: 7,012
     EAS: 1,897
     EUR: 372,463
     SAS: 7,607

3. Performing PCA on average signature loadings...
   ✓ PCA result shape: (388979, 2)
   ✓ Explained variance ratio:
     PC1: 0.2679 (26.79%)
     PC2: 0.1927 (19.27%)
     Total: 0.4605 (46.05%)

4. Creating PC1 vs PC2 plot colored by ancestry...

✓ Saved PCA plot to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/pca_signature_loadings_by_ancestry.png
No description has been provided for this image
================================================================================
SUMMARY: Mean PC1 and PC2 by Population
================================================================================

AFR (n=7,012):
  PC1: -0.0003 ± 0.0281
  PC2: -0.0006 ± 0.0240

EAS (n=1,897):
  PC1: -0.0012 ± 0.0280
  PC2: -0.0000 ± 0.0234

SAS (n=7,607):
  PC1: -0.0003 ± 0.0282
  PC2: -0.0003 ± 0.0239

EUR (n=372,463):
  PC1: +0.0000 ± 0.0284
  PC2: +0.0000 ± 0.0240

================================================================================
✓ PCA on signature loadings complete!
================================================================================

7. Create Visualizations for Each Ancestry¶

For each ancestry, plot:

  1. Signatures of interest (e.g., sig 5 for SAS, sig 15 for EAS)
  2. Top variable signatures (5 most variable between populations)

This shows whether ancestry-specific effects increase continuously with ancestry probability (pred).

8. Analyze Variance vs. Ancestry Probability¶

Check if variance decreases with ancestry probability (as expected: more certain ancestry → more consistent genetic background → less variance).

9. Analyzing variance vs. ancestry probability...

   Checking sample size distribution across ancestry probability bins...

   AFR sample sizes by pred bin:
     pred 0.30-0.34: 65 patients
     pred 0.38-0.41: 23 patients
     pred 0.45-0.49: 36 patients
     pred 0.52-0.56: 30 patients
     pred 0.60-0.63: 41 patients
     pred 0.67-0.71: 239 patients
     pred 0.74-0.78: 155 patients
     pred 0.82-0.85: 329 patients
     pred 0.89-0.93: 231 patients
     pred 0.96-1.00: 6,467 patients

   EAS sample sizes by pred bin:
     pred 0.28-0.31: 9 patients
     pred 0.35-0.39: 4 patients
     pred 0.43-0.47: 16 patients
     pred 0.50-0.54: 86 patients
     pred 0.58-0.62: 37 patients
     pred 0.66-0.70: 11 patients
     pred 0.73-0.77: 4 patients
     pred 0.81-0.85: 5 patients
     pred 0.89-0.92: 61 patients
     pred 0.96-1.00: 1,568 patients

   EUR sample sizes by pred bin:
     pred 0.28-0.32: 43 patients
     pred 0.36-0.40: 229 patients
     pred 0.44-0.47: 362 patients
     pred 0.51-0.55: 1,632 patients
     pred 0.59-0.62: 1,466 patients
     pred 0.66-0.70: 932 patients
     pred 0.74-0.77: 899 patients
     pred 0.81-0.85: 348 patients
     pred 0.89-0.92: 1,200 patients
     pred 0.96-1.00: 371,810 patients

   SAS sample sizes by pred bin:
     pred 0.30-0.34: 58 patients
     pred 0.37-0.41: 122 patients
     pred 0.45-0.48: 87 patients
     pred 0.52-0.56: 121 patients
     pred 0.60-0.63: 75 patients
     pred 0.67-0.71: 130 patients
     pred 0.74-0.78: 60 patients
     pred 0.82-0.85: 107 patients
     pred 0.89-0.93: 277 patients
     pred 0.96-1.00: 7,258 patients
   ✓ Saved variance analysis for EAS to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/variance_vs_pred_EAS.png
No description has been provided for this image
   ✓ Saved variance analysis for SAS to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/variance_vs_pred_SAS.png
No description has been provided for this image
   Expected: Variance should DECREASE as pred increases (more certain ancestry → more consistent)
   However, if sample size increases dramatically at high pred, variance may appear higher
   due to better sampling of the true distribution (not necessarily real increased variance).
   Check sample sizes by bin to interpret variance patterns.
8. Creating visualizations...

   Creating figure for AFR...
   ✓ Saved to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/signature_loadings_by_ancestry_AFR.png
No description has been provided for this image
   Creating figure for EAS...
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_6557/1409193948.py:77: UserWarning: Glyph 11088 (\N{WHITE MEDIUM STAR}) missing from font(s) Arial.
  plt.tight_layout()
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_6557/1409193948.py:81: UserWarning: Glyph 11088 (\N{WHITE MEDIUM STAR}) missing from font(s) Arial.
  plt.savefig(fig_path, dpi=300, bbox_inches='tight')
   ✓ Saved to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/signature_loadings_by_ancestry_EAS.png
/opt/miniconda3/envs/new_env_pyro2/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 11088 (\N{WHITE MEDIUM STAR}) missing from font(s) Arial.
  fig.canvas.print_figure(bytes_io, **kw)
No description has been provided for this image
   Creating figure for EUR...
   ✓ Saved to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/signature_loadings_by_ancestry_EUR.png
No description has been provided for this image
   Creating figure for SAS...
   ✓ Saved to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/signature_loadings_by_ancestry_SAS.png
No description has been provided for this image
================================================================================
ANALYSIS COMPLETE
================================================================================

✓ Created 4 figures (one per ancestry)
✓ Analyzed 400,000 patients
✓ Signatures of interest: {'SAS': [5], 'EAS': [15]}
✓ Top 5 most variable signatures: [15  5 19 18  6]

Output directory: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis

Additional Visualization: PC1 vs PC2 Colored by Signature Loading¶

This visualization shows the relationship between genetic ancestry (PC1/PC2) and signature loadings, providing an alternative view to the ancestry probability analysis above.

================================================================================
SIGNATURE LOADING SPACE: Most Variable Signatures Colored by Population
================================================================================

Using discovery thetas (joint estimation) for signature loadings...
✓ Loaded discovery thetas shape: (400000, 21, 52)
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_6557/102893570.py:15: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  discovery_thetas = torch.load(str(discovery_theta_path), map_location='cpu')
✓ Average signature loadings shape: (400000, 21)

Calculating deviations from reference theta...
   ✓ Deviations shape: (400000, 21)

   Note: Filtering out non-population labels (UKB, etc.)
   Before filtering: 400,000 patients
   After filtering: 388,979 patients

   Debug: Population labels unique values: ['AFR' 'EAS' 'EUR' 'SAS']
   Debug: Population label counts:
     AFR: 7,012
     EAS: 1,897
     EUR: 372,463
     SAS: 7,607

   Debug: Deviation ranges for signatures:
     Signature 5: min=-0.036847, max=0.266907, mean=0.003854
     Signature 8: min=-0.060168, max=0.199177, mean=0.000767
     Signature 15: min=-0.008456, max=0.063126, mean=0.000051
     Signature 10: min=-0.014800, max=0.109674, mean=0.000442
     Signature 13: min=-0.011642, max=0.106154, mean=0.000073

   Plotting 5 vs 8: Found populations ['AFR' 'EAS' 'EUR' 'SAS']
     Plotted AFR: 7,012 points, sig1 range=[-0.0345, 0.1658], sig2 range=[-0.0491, 0.1390]
     Plotted EAS: 1,897 points, sig1 range=[-0.0315, 0.1528], sig2 range=[-0.0476, 0.1076]
     Plotted EUR: 372,463 points, sig1 range=[-0.0368, 0.2669], sig2 range=[-0.0602, 0.1992]
     Plotted SAS: 7,607 points, sig1 range=[-0.0326, 0.1741], sig2 range=[-0.0465, 0.1371]

   Plotting 5 vs 15: Found populations ['AFR' 'EAS' 'EUR' 'SAS']
     Plotted AFR: 7,012 points, sig1 range=[-0.0345, 0.1658], sig2 range=[-0.0076, 0.0552]
     Plotted EAS: 1,897 points, sig1 range=[-0.0315, 0.1528], sig2 range=[-0.0078, 0.0382]
     Plotted EUR: 372,463 points, sig1 range=[-0.0368, 0.2669], sig2 range=[-0.0085, 0.0631]
     Plotted SAS: 7,607 points, sig1 range=[-0.0326, 0.1741], sig2 range=[-0.0073, 0.0416]

   Plotting 8 vs 15: Found populations ['AFR' 'EAS' 'EUR' 'SAS']
     Plotted AFR: 7,012 points, sig1 range=[-0.0491, 0.1390], sig2 range=[-0.0076, 0.0552]
     Plotted EAS: 1,897 points, sig1 range=[-0.0476, 0.1076], sig2 range=[-0.0078, 0.0382]
     Plotted EUR: 372,463 points, sig1 range=[-0.0602, 0.1992], sig2 range=[-0.0085, 0.0631]
     Plotted SAS: 7,607 points, sig1 range=[-0.0465, 0.1371], sig2 range=[-0.0073, 0.0416]

   Plotting 10 vs 13: Found populations ['AFR' 'EAS' 'EUR' 'SAS']
     Plotted AFR: 7,012 points, sig1 range=[-0.0129, 0.0837], sig2 range=[-0.0107, 0.0577]
     Plotted EAS: 1,897 points, sig1 range=[-0.0115, 0.0885], sig2 range=[-0.0088, 0.0410]
     Plotted EUR: 372,463 points, sig1 range=[-0.0148, 0.1097], sig2 range=[-0.0116, 0.1062]
     Plotted SAS: 7,607 points, sig1 range=[-0.0143, 0.0916], sig2 range=[-0.0096, 0.0454]

✓ Saved signature space plot to: /Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision/new_notebooks/results/ancestry_analysis/signature_space_by_population.png
No description has been provided for this image
✓ Signature loading space visualization complete!
  - Plotted 388,979 patients
  - Most variable signatures: [5, 8, 15, 10, 13]
  - Key pairs shown: [(5, 8), (5, 15), (8, 15), (10, 13)]
  - Using discovery thetas (joint estimation)
  - Showing deviations from reference, colored by population

================================================================================
SUMMARY: Mean Deviations by Population
================================================================================

AFR (n=7,012):
  Signature 5: +0.0032 ± 0.0234
  Signature 8: +0.0008 ± 0.0189
  Signature 15: +0.0000 ± 0.0036
  Signature 10: +0.0005 ± 0.0071
  Signature 13: +0.0000 ± 0.0045

EAS (n=1,897):
  Signature 5: +0.0030 ± 0.0232
  Signature 8: +0.0013 ± 0.0184
  Signature 15: -0.0000 ± 0.0035
  Signature 10: +0.0007 ± 0.0076
  Signature 13: -0.0001 ± 0.0046

SAS (n=7,607):
  Signature 5: +0.0035 ± 0.0235
  Signature 8: +0.0009 ± 0.0191
  Signature 15: +0.0001 ± 0.0036
  Signature 10: +0.0005 ± 0.0073
  Signature 13: +0.0000 ± 0.0045

EUR (n=372,463):
  Signature 5: +0.0039 ± 0.0241
  Signature 8: +0.0008 ± 0.0190
  Signature 15: +0.0001 ± 0.0036
  Signature 10: +0.0004 ± 0.0071
  Signature 13: +0.0001 ± 0.0047
================================================================================

9. WITH vs WITHOUT PCs Comparison¶

Key Question: Does PC adjustment change biological interpretations?

Answer: No - PC adjustment controls for population stratification while preserving biological signal.

This section compares signature-disease associations (φ) and patient signature loadings (θ) between models trained WITH and WITHOUT PC adjustment.

================================================================================
PHI COMPARISON: WITH PCs vs WITHOUT PCs
================================================================================

This analysis shows that PC adjustment:
  • Controls for population stratification
  • Preserves biological signal (high correlation)
  • Does NOT drastically change disease-signature associations

📊 PHI COMPARISON RESULTS:
   Total points: 380,016
   Correlation: 1.000000
   Mean absolute difference: 0.000077
   ✓ High correlation indicates PC adjustment preserves biological signal
No description has been provided for this image

9b. Trajectory Deviations: WITH vs WITHOUT PCs¶

Compare signature trajectory deviations by ancestry for models WITH and WITHOUT PC adjustment. This shows how PC adjustment affects signature loadings while preserving overall patterns.

/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_6557/4068026007.py:9: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  thetas_nopcs = torch.load(thetas_nopcs_path, map_location='cpu')
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_6557/4068026007.py:10: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  thetas_withpcs = torch.load(thetas_withpcs_path, map_location='cpu')
No description has been provided for this image
✅ Trajectory comparison complete
   Key observation: Patterns are preserved WITH vs WITHOUT PCs
   PC adjustment controls stratification without changing biological interpretations

9c. PC-Induced Shift Heatmap: Which Signatures Shift Most?¶

This heatmap shows which signatures shift most with PC adjustment, justifying why we focus on signatures 5 and 15 for ancestry-specific analyses.

Showing PC-induced shift heatmap for SAS
  (Shift magnitude: 0.0343)
No description has been provided for this image
📊 SIGNATURES WITH LARGEST PC-INDUCED SHIFTS (SAS):
   1. Signature 5: max |Δ| = 0.0343 at time 35 (Δ = +0.0343)
   2. Signature 15: max |Δ| = 0.0119 at time 35 (Δ = +0.0119)
   3. Signature 17: max |Δ| = 0.0088 at time 33 (Δ = -0.0088)
   4. Signature 3: max |Δ| = 0.0084 at time 0 (Δ = -0.0084)
   5. Signature 7: max |Δ| = 0.0083 at time 4 (Δ = +0.0083)

✅ Key finding: Signatures 5 and 15 show largest PC-induced shifts
   This justifies focusing on these signatures for ancestry-specific analyses
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_6557/3608385456.py:15: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  thetas_nopcs = torch.load('/Users/sarahurbut/aladynoulli2/pyScripts/new_thetas_with_sex_nopcs_retrospective.pt').detach().numpy()
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_6557/3608385456.py:16: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  thetas_withpcs = torch.load('/Users/sarahurbut/aladynoulli2/pyScripts/new_thetas_with_pcs_retrospective.pt').detach().numpy()
/var/folders/fl/ng5crz0x0fnb6c6x8dk7tfth0000gn/T/ipykernel_6557/3608385456.py:26: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  ancestry = pd.read_csv(ancestry_path, sep='\t', usecols=['eid', 'rf80']).drop_duplicates('eid')
No description has been provided for this image

Summary and Response¶

Key Findings¶

  1. Ancestry effects are continuous, not binary

    • Signature 5 (SAS): Mean deviation increases from ~0.015 to ~0.036 as ancestry probability increases from 0.3 to 0.95
    • Signature 15 (EAS): Similar continuous increase with ancestry probability
    • Variance patterns: Explained by sample size (larger samples at high pred show full distribution)
  2. PC adjustment preserves biological signal

    • Phi (signature-disease associations): Correlation >0.99 between WITH and WITHOUT PCs
    • Mean difference <0.002 indicates minimal impact on disease-signature associations
    • Trajectory patterns preserved: Signature deviations by ancestry are similar WITH vs WITHOUT PCs
  3. PC-induced shifts identify ancestry-relevant signatures

    • Signatures 5 and 15 show largest PC-induced shifts (heatmap analysis)
    • This justifies focusing on signatures 5 (SAS) and 15 (EAS) for ancestry-specific analyses
    • Other signatures show minimal shifts, confirming PC adjustment targets ancestry-specific variation

Implications¶

  1. Model captures continuous genetic variation, not discrete ancestry groups
  2. Ancestry probability can be used to refine predictions
  3. Effects are polygenic (continuous) rather than monogenic (binary)
  4. PC adjustment controls stratification without changing biological interpretations

Response to Reviewer¶

"We address population stratification through two complementary analyses:

  1. Continuous ancestry effects: We analyze signature loadings as a continuous function of ancestry probability (pred), avoiding harsh thresholding. We find that ancestry effects are continuous (e.g., Signature 5 increases continuously with SAS ancestry probability), demonstrating that signatures capture polygenic architecture rather than discrete ancestry groups.

  2. PC adjustment validation: We compare models WITH and WITHOUT PC adjustment and find that PC adjustment preserves biological signal (phi correlation >0.99, mean difference <0.002) while controlling for population stratification. Signature trajectory patterns by ancestry are preserved, confirming that PC adjustment improves model calibration without changing biological interpretations.

  3. Signature selection rationale: PC-induced shift analysis reveals that signatures 5 and 15 show the largest shifts with PC adjustment, justifying our focus on these signatures for ancestry-specific analyses (Signature 5 for SAS ancestry, Signature 15 for EAS ancestry)."